| scorer | D2 - D2 Absolute Error Score | EVS - Explained Variance Score | MAE - Mean Absolute Error | MAPE - Mean Absolute Percentage Error | MSE - Mean Squared Error | MeAE - Median Absolute Error | Pinball - Mean Pinball Loss | R2 - Coefficient of Determination | RMSE - Root Mean Squared Error | |||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| param_vars__columns | params | set | agg_fn | |||||||||
| Loading... (need help?) | ||||||||||||
| param_vars__columns | params | set | agg_fn | D2 - D2 Absolute Error Score | EVS - Explained Variance Score | MAE - Mean Absolute Error | MAPE - Mean Absolute Percentage Error | MSE - Mean Squared Error | MeAE - Median Absolute Error | Pinball - Mean Pinball Loss | R2 - Coefficient of Determination | RMSE - Root Mean Squared Error |
Fine tuning the polynomial Model
Let’s address here a couple of the to-dos from the previous pages. Namely, let’s check the degree of the polynomial using a grid search and checkout other climatic variables, playing a bit with variable selection.
Tune polynomial degree
First, let’s fine-tune the degree of the polynomial. From the initial scatter plots, it seems that a 2 or 3 degree polynomial would be good. But let’s throw a grid search on that.
It seems degree 4 would be best. Of course there is a a small variation according to the metric/scorer used, but overall, I think 4 seems best. The differences are rather small, but there is still this tension between MAE and RSME. A model that better fit on the warmer season and therefore has smaller small errors and a small MAE, tend to have a few very large errors, affecting the RSME. And the other way around.
Variable selection
Now let’s try here a brute-force approach to variable selection. Thereby, taking a not-so-thoughtful-but-quick-and-effective approach to trying out other climatic variables and checking the associations with relative humidity and other variables, to see if there is indeed a signal or just noise in some of them like humidity and so on.
Just let it crunch through a bunch of variable combinations. There will be many non-sensical or irrelevant combinations. But it’s just fast to write and the machine will have to work, me not so much.
From topic knowledge and the first correlations observed, we would expect mostly temprature to play a key role in the model. Yet, other variables such a humidity, pressure or condensation point could also be relevant. So let’s throw all that into a grid search and see what it spits out of it.
tt: Temperatur der Luft in 2m Hoehe °Crf_tu: relative Feuchte %td: Taupunktstemperatur °Cvp_std: berechnete Stundenwerte des Dampfdruckes hpatf_std: berechnete Stundenwerte der Feuchttemperatur °Cp_std: Stundenwerte Luftdruck hpa
Well humidity and other climatic variables can only improve the model marginally. A proper mediation analysis would still be in order, but this at least shed some light on it. Interestingly, some models without temperature but the set of other climatic variables almost equal the performance of the best model with temperature. Overall, it seems that at least temperature, humidity and pressure should be considered. Yet, the tend to be collineal and thus not really be able to used them all just like that in this kind of model.
One last brute-force approach for today and let automatically choose the best model
Best parameters
{'polynomial__degree': 4, 'vars__columns': ['tf_std_mean']}
Pipeline(steps=[('vars', ColumnSelector(columns=['tf_std_mean'])),
('polynomial', PolynomialFeatures(degree=4)),
('model', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('vars', ColumnSelector(columns=['tf_std_mean'])),
('polynomial', PolynomialFeatures(degree=4)),
('model', LinearRegression())])ColumnSelector(columns=['tf_std_mean'])
PolynomialFeatures(degree=4)
LinearRegression()
| scorer | D2 - D2 Absolute Error Score | EVS - Explained Variance Score | MAE - Mean Absolute Error | MAPE - Mean Absolute Percentage Error | MSE - Mean Squared Error | MeAE - Median Absolute Error | Pinball - Mean Pinball Loss | R2 - Coefficient of Determination | RMSE - Root Mean Squared Error | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| param_polynomial__degree | param_vars__columns | params | set | agg_fn | |||||||||
| Loading... (need help?) | |||||||||||||
| param_polynomial__degree | param_vars__columns | params | set | agg_fn | D2 - D2 Absolute Error Score | EVS - Explained Variance Score | MAE - Mean Absolute Error | MAPE - Mean Absolute Percentage Error | MSE - Mean Squared Error | MeAE - Median Absolute Error | Pinball - Mean Pinball Loss | R2 - Coefficient of Determination | RMSE - Root Mean Squared Error |
| param_polynomial__degree | param_vars__columns | params | mean_test_MAE - Mean Absolute Error | mean_test_MSE - Mean Squared Error | mean_test_RMSE - Root Mean Squared Error | mean_test_R2 - Coefficient of Determination | mean_test_MAPE - Mean Absolute Percentage Error | mean_test_EVS - Explained Variance Score | mean_test_MeAE - Median Absolute Error | mean_test_D2 - D2 Absolute Error Score | mean_test_Pinball - Mean Pinball Loss |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Loading... (need help?) |
Assessment of the best model in that brute force approach
Metrics

| Single Split | CV | |||
|---|---|---|---|---|
| train | test | test | train | |
| Loading... (need help?) | ||||
Scatter plot matrix
Observed vs. Predicted and Residuals vs. Predicted
Check for …
check the residuals to assess the goodness of fit.
- white noise or is there a pattern?
- heteroscedasticity?
- non-linearity?
Normality of Residuals:
Check for …
- Are residuals normally distributed?




Leverage
Scale-Location plot


Residuals Autocorrelation Plot


Residuals vs Time
Compare models
So let’s compare the models with and without humidity.
Metrics
Single split
Metrics based on the test set of the single split
Cross validation
Predictions, residuals, observed
next
Time vs. Predicted and Observed
Time vs. Residuals
Model details
Pipeline(steps=[('vars', ColumnSelector(columns=['tt_tu_mean', 'rf_tu_mean'])),
('polynomial', PolynomialFeatures()),
('model', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('vars', ColumnSelector(columns=['tt_tu_mean', 'rf_tu_mean'])),
('polynomial', PolynomialFeatures()),
('model', LinearRegression())])ColumnSelector(columns=['tt_tu_mean', 'rf_tu_mean'])
PolynomialFeatures()
LinearRegression()
Pipeline(steps=[('vars', ColumnSelector(columns=['tf_std_mean'])),
('polynomial', PolynomialFeatures(degree=4)),
('model', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('vars', ColumnSelector(columns=['tf_std_mean'])),
('polynomial', PolynomialFeatures(degree=4)),
('model', LinearRegression())])ColumnSelector(columns=['tf_std_mean'])
PolynomialFeatures(degree=4)
LinearRegression()
Pipeline(steps=[('vars', ColumnSelector(columns=['tt_tu_mean', 'tf_std_mean'])),
('polynomial', PolynomialFeatures(degree=4)),
('model', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('vars', ColumnSelector(columns=['tt_tu_mean', 'tf_std_mean'])),
('polynomial', PolynomialFeatures(degree=4)),
('model', LinearRegression())])ColumnSelector(columns=['tt_tu_mean', 'tf_std_mean'])
PolynomialFeatures(degree=4)
LinearRegression()
A nice improvement over the baseline model, and the model without humidity is slightly better across all metrics (a rather small improvement, but across all metrics).
TODOs
Sorry, ich konnte nicht früher antworten. Wir waren unterwegs und sind erst ein paar Minuten vor dem Sturm zurückgekommen.
Wir haben wohl nicht so viel geübt, wie wir sollten, aber wir haben ja ein paar verschiedene Schulwege ausprobiert. Am Ende haben wir beschlossen, sie gehen von hier über die Edelweißstraße zur Kaiserblickstraße, dann an eurem Haus vorbei und biegen links ab zu der Sparkasse-Ampel. Danach gehen wir an der Eisdiele vorbei, überqueren bei der Bedarfsampel die Straße und gehen dann hinauf bis zur Schule.
Macht ihr etwas Ähnliches?
Ich mag die Kreuzung bei der Sparkasse eigentlich nicht. Ich finde sie etwas gefährlich. Aber die Kreuzung bei Netto finde ich noch viel gefährlicher, und die Ampel bei Prechtl, obwohl es meiner Meinung nach sicherer ist, macht den Weg länger. Außerdem ist das Gehen auf dem schmalen Gehweg entlang der Hauptstraße auch nicht besonders sicher. Also …
Wir planen sowieso, die Kinder in den ersten Tagen zu begleiten und dann zu schauen, wie es weiterläuft.
Und was habt ihr darüber gedacht? Vielleicht können wir etwas zusammen organisieren.